Skip to content

Add plot_features_relevance #579

Merged
merged 8 commits into from
Mar 10, 2022
Merged

Add plot_features_relevance #579

merged 8 commits into from
Mar 10, 2022

Conversation

Mr-Geekman
Copy link
Contributor

@Mr-Geekman Mr-Geekman commented Mar 2, 2022

IMPORTANT: Please do not create a Pull Request without creating an issue first.

Before submitting (must do checklist)

  • Did you read the contribution guide?
  • Did you update the docs? We use Numpy format for all the methods and classes.
  • Did you write any new necessary tests?
  • Did you update the CHANGELOG?

Type of Change

  • Examples / docs / tutorials / contributors update
  • Bug fix (non-breaking change which fixes an issue)
  • Improvement (non-breaking change which improves an existing feature)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Proposed Changes

Look #564.

Related Issue

#564.

Closing issues

Closes #564.

@Mr-Geekman Mr-Geekman added the enhancement New feature or request label Mar 2, 2022
@Mr-Geekman Mr-Geekman self-assigned this Mar 2, 2022
@codecov-commenter
Copy link

codecov-commenter commented Mar 2, 2022

Codecov Report

Merging #579 (0e48462) into master (8b3c063) will decrease coverage by 0.44%.
The diff coverage is 26.19%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #579      +/-   ##
==========================================
- Coverage   86.18%   85.74%   -0.45%     
==========================================
  Files         117      117              
  Lines        5719     5758      +39     
==========================================
+ Hits         4929     4937       +8     
- Misses        790      821      +31     
Impacted Files Coverage Δ
etna/analysis/plotters.py 17.11% <13.88%> (-0.40%) ⬇️
etna/analysis/__init__.py 100.00% <100.00%> (ø)
etna/analysis/feature_selection/__init__.py 100.00% <100.00%> (ø)
etna/analysis/feature_selection/mrmr.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8b3c063...0e48462. Read the comment docs.

@Mr-Geekman
Copy link
Contributor Author

Script for demonstration:

from typing import Tuple

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from etna.analysis import StatisticsRelevanceTable
from etna.analysis import plot_feature_relevance
from etna.datasets import TSDataset


def simple_df_relevance() -> Tuple[pd.DataFrame, pd.DataFrame]:
    rng = np.random.default_rng(42)
    timestamp = pd.date_range("2021-01-01", "2021-02-01")

    df_1 = pd.DataFrame({"timestamp": timestamp, "target": np.arange(32), "segment": "1"})
    df_2 = pd.DataFrame({"timestamp": timestamp[5:], "target": np.arange(5, 32), "segment": "2"})
    df = pd.concat([df_1, df_2], ignore_index=True)
    df = TSDataset.to_dataset(df)

    timestamp = pd.date_range("2020-12-01", "2021-02-11")

    df_1 = pd.DataFrame(
        {
            "timestamp": timestamp,
            "regressor_1": np.arange(len(timestamp)),
            "regressor_2": np.zeros(len(timestamp)),
            "regressor_3": rng.normal(loc=0, scale=1.0, size=len(timestamp)),
            "regressor_4": rng.uniform(low=0, high=10.0, size=len(timestamp)),
            "regressor_5": rng.exponential(scale=1.0, size=len(timestamp)),
            "cat_feature": "hello",
            "segment": "1",
        }
    )
    df_2 = pd.DataFrame(
        {
            "timestamp": timestamp[5:],
            "regressor_1": np.sin(-np.arange(len(timestamp) - 5)),
            "regressor_2": np.log(np.arange(1, len(timestamp) - 4)),
            "regressor_3": rng.normal(loc=0, scale=2.0, size=len(timestamp) - 5),
            "regressor_4": rng.uniform(low=0, high=5.0, size=len(timestamp) - 5),
            "regressor_5": rng.exponential(scale=0.1, size=len(timestamp) - 5),
            "cat_feature": "bye",
            "segment": "2",
        }
    )
    df_exog = pd.concat([df_1, df_2], ignore_index=True)
    df_exog = TSDataset.to_dataset(df_exog)

    return df, df_exog


df, df_exog = simple_df_relevance()
ts = TSDataset(df=df, df_exog=df_exog, known_future="all", freq="D")
relevance_table = StatisticsRelevanceTable()

plot_feature_relevance(
    ts=ts,
    relevance_table=relevance_table,
    normalized=False,
    relevance_aggregation_mode="per-segment",
    top_k=None,
    segments=None,
    columns_num=2,
    figsize=(10, 5),
)

plt.savefig("per-segment.png")

plot_feature_relevance(
    ts=ts,
    relevance_table=relevance_table,
    normalized=False,
    relevance_aggregation_mode="mean",
    top_k=None,
    segments=None,
    columns_num=2,
    figsize=(10, 5),
)

plt.savefig("mean.png")

mean:
mean

per-segment:
per-segment

Copy link
Collaborator

@alex-hse-repository alex-hse-repository left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May you also try to test this method in the different combinations of the parameters(different relevance tables, normalization, top_k, aggregation mod)

etna/analysis/plotters.py Outdated Show resolved Hide resolved
etna/analysis/plotters.py Outdated Show resolved Hide resolved
etna/analysis/plotters.py Show resolved Hide resolved
etna/analysis/plotters.py Show resolved Hide resolved
@Mr-Geekman
Copy link
Contributor Author

Скрипт:

from typing import Tuple

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

from etna.analysis import ModelRelevanceTable
from etna.analysis import StatisticsRelevanceTable
from etna.analysis import plot_feature_relevance
from etna.datasets import TSDataset
from etna.transforms import DateFlagsTransform
from etna.transforms import FilterFeaturesTransform


def simple_df_relevance() -> Tuple[pd.DataFrame, pd.DataFrame]:
    rng = np.random.default_rng(42)
    timestamp = pd.date_range("2021-01-01", "2021-02-01")

    df_1 = pd.DataFrame({"timestamp": timestamp, "target": np.arange(32), "segment": "1"})
    df_2 = pd.DataFrame({"timestamp": timestamp[5:], "target": np.arange(5, 32), "segment": "2"})
    df = pd.concat([df_1, df_2], ignore_index=True)
    df = TSDataset.to_dataset(df)

    timestamp = pd.date_range("2020-12-01", "2021-02-11")

    df_1 = pd.DataFrame(
        {
            "timestamp": timestamp,
            "regressor_1": np.arange(len(timestamp)),
            "regressor_2": np.zeros(len(timestamp)),
            "regressor_3": rng.normal(loc=0, scale=1.0, size=len(timestamp)),
            "regressor_4": rng.uniform(low=0, high=10.0, size=len(timestamp)),
            "regressor_5": rng.exponential(scale=1.0, size=len(timestamp)),
            "cat_feature": "hello",
            "segment": "1",
        }
    )
    df_2 = pd.DataFrame(
        {
            "timestamp": timestamp[5:],
            "regressor_1": np.sin(-np.arange(len(timestamp) - 5)),
            "regressor_2": np.log(np.arange(1, len(timestamp) - 4)),
            "regressor_3": rng.normal(loc=0, scale=2.0, size=len(timestamp) - 5),
            "regressor_4": rng.uniform(low=0, high=5.0, size=len(timestamp) - 5),
            "regressor_5": rng.exponential(scale=0.1, size=len(timestamp) - 5),
            "cat_feature": "bye",
            "segment": "2",
        }
    )
    df_exog = pd.concat([df_1, df_2], ignore_index=True)
    df_exog = TSDataset.to_dataset(df_exog)

    return df, df_exog


def main():
    df, df_exog = simple_df_relevance()
    ts = TSDataset(df=df, df_exog=df_exog, known_future="all", freq="D")
    statistics_relevance_table = StatisticsRelevanceTable()
    model_relevance = RandomForestRegressor(n_estimators=100, random_state=42)
    model_relevance_table = ModelRelevanceTable()

    plot_feature_relevance(
        ts=ts,
        relevance_table=statistics_relevance_table,
        normalized=False,
        relevance_aggregation_mode="per-segment",
        top_k=None,
        segments=None,
        columns_num=2,
        figsize=(10, 5),
    )

    plt.savefig("per-segment.png")

    plot_feature_relevance(
        ts=ts,
        relevance_table=statistics_relevance_table,
        normalized=False,
        relevance_aggregation_mode="mean",
        top_k=None,
        segments=None,
        columns_num=2,
        figsize=(10, 5),
    )

    plt.savefig("mean.png")

    transforms = [
        FilterFeaturesTransform(exclude=["cat_feature"]),
        DateFlagsTransform(day_number_in_week=True, day_number_in_month=True, is_weekend=True, out_column="date_flag"),
    ]
    ts.fit_transform(transforms)

    plot_feature_relevance(
        ts=ts,
        relevance_table=model_relevance_table,
        normalized=False,
        relevance_aggregation_mode="per-segment",
        relevance_params={"model": model_relevance},
        top_k=5,
        segments=None,
        columns_num=2,
        figsize=(10, 5),
    )

    plt.savefig("random_forest_per_segment.png")

    plot_feature_relevance(
        ts=ts,
        relevance_table=model_relevance_table,
        normalized=False,
        relevance_aggregation_mode="mean",
        relevance_params={"model": model_relevance},
        top_k=5,
        segments=None,
        columns_num=2,
        figsize=(10, 5),
    )

    plt.savefig("random_forest_mean.png")

    plot_feature_relevance(
        ts=ts,
        relevance_table=model_relevance_table,
        normalized=True,
        relevance_aggregation_mode="mean",
        relevance_params={"model": model_relevance},
        top_k=5,
        segments=None,
        columns_num=2,
        figsize=(10, 5),
    )

    plt.savefig("random_forest_mean_normalized.png")


if __name__ == "__main__":
    main()

Статистическая таблица, по сегментам:
per-segment

Статистическая таблица, усреднение:
mean

Таблица по модели, по сегментам:
random_forest_per_segment

Таблица по сегментам, усреднение:
random_forest_mean

Таблица по сегментам, усреднение с нормализацией:
random_forest_mean_normalized

@alex-hse-repository alex-hse-repository merged commit a0a84fd into master Mar 10, 2022
@iKintosh iKintosh deleted the issue-564 branch March 22, 2022 08:39
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature relevance visualisation
3 participants